Background

Precision treatment in hematologic malignancies relies on rapid, accurate interpretation of tumor genomic data, yet current workflows lag behind the literature surge. Databases such as OncoKB and CIViC, annotating thousands of clinically meaningful variants manually curated, keep expanding (Chakravarty et al, 2017; Griffith et al, 2017). However, clinicians still devote a substantial time on database searches,need up to 30 minutes to classify a single variant, and view few unsolicited genetic results in the EHRs (Chin et al, 2022; Nestor et al, 2021). Fragmented schemas, rigid search syntax, paywalls, divergent evidence grading systems, and poor data concordance further impede bedside use. These gaps demand automated, clinician decision support. We developed the Oncology Dynamic Integrated Framework (OncoDIF) to transparently extract evidence from the literature into a curated knowledge base, OncoDIF-KB, using advanced artificial intelligence (AI) methods. OncoDIF enables natural language querying of the knowledge base and delivers guideline aligned reports in real time.

Methods

OncoDIF includes three integrated modules.

(1) A DsPY based ReACT Extraction Engine processes PDFs using omniOCR and applies a GPT-4.1 powered agentic system to extract evidence through a four-step loop: Reason, Act, Think, Conclude. The system was trained and prompts optimized using the open-source human curated CIViC Dataset. Each step is tracked for reasoning traces and for transparency. Extracted items undergo structured validation and are assigned a confidence score; low-confidence outputs are flagged for human review.

(2) A Natural-Language-to-SQL Translator was trained on 21,000 synthetic question–query (Extracted from CIViC Dataset) pairs using a LoRA-tuned Llama-3 model, enabling clinicians to query the OncoDIF using plain language.

(3) A Knowledge-Enhanced Interface harmonizes >15,000 synonyms for genes, variants, diseases, and drugs; reformulates failed queries through multi-step reasoning; and presents results via a Clinical Interpretation Engine that generates NCCN-aligned summaries with evidence levels, response metrics, and trial suggestions.

Results

OncoDIF automatically extracts clinically relevant variants from the literature and stores them in a knowledge base (OncoDIF-KB), in a CIViC-compatible format. A Knowledge-Enhanced Interface allows users to query OncoDIF-KB using natural language, which is internally translated into the database language SQL to correctly retrieve the requested information, ensuring accurate handling of nomenclature and synonyms. Clinicians can ask simple queries such as “List all the variants found in BRAF” or complex queries requiring sophisticated data handling such as “Analyze therapy response patterns across diseases for variants with > 3 associated clinical trials.“ 

Evaluating OncoDIF on 1,300 oncology articles (70 % train, 30 % validation) showed 95.2% precision and 93.8% recall. Full reasoning traces were retained for 97% of items, and throughput averaged 2 PDF per minute, projecting a 10-fold reduction in curator hours. The natural-language-to-SQL translator produced correct SQL in 83.24% of unseen questions versus 55% for the untuned baseline, with median latency of 7 seconds. In 200 simulated scenarios, overall query success climbed from 76% to 92% after synonym harmonization. Success for colloquial/misspelled terminology improved from 40% to 90%. No unsafe recommendations were generated, and every action was logged for audit review.

To further illustrate real-world impact, we focused on GPRC5D, an emerging CAR-T and bispecific target in multiple myeloma whose clinical evidence is sparsely represented in public databases. In under two minutes, OncoDIF automatically extracted five phase I/II trial records from a 2024 review, generated CIViC-formatted evidence items of resistance, and produced a one-page clinical report summarizing response rates, CRS incidence and trial eligibility, tasks that previously required expert curators several hours.

Conclusions

OncoDIF enables fast, automated curation of scientific knowledge from the literature and their interactive interrogation in natural language, while preserving audit trails essential for regulatory adoption. By supplying guideline-aligned interpretations, the framework has the potential to accelerate treatment selection, expand genotype-matched trial enrollment and improve outcomes for patients with hematologic malignancies.

This content is only available as a PDF.
Sign in via your Institution